A Comparative Study of the Effect of Word Segmentation On Chinese Terminology Extraction
نویسندگان
چکیده
Automatic term extraction is the first step towards automatic or semi-automatic update of existing domain knowledge base. Most of the researches applied word segmentation as a preprocessing step to Chinese term extraction. However, segmentation ambiguity is unavoidable, especially in identifying unknown words for Chinese. In this paper, we discuss the effect and limitations of segmentation to Chinese terminology extraction. Detailed study shows that propagated errors caused by word segmentation have great impact on the result of terminology extraction. Based on our analysis and experiments, it is proven that character-based terminology extraction yields much better result than that using segmentation as a preprocessing step.
منابع مشابه
Improving Patent Translation using Bilingual Term Extraction and Re-tokenization for Chinese-Japanese
Unlike European languages, many Asian languages like Chinese and Japanese do not have typographic boundaries in written system. Word segmentation (tokenization) that break sentences down into individual words (tokens) is normally treated as the first step for machine translation (MT). For Chinese and Japanese, different rules and segmentation tools lead different segmentation results in differe...
متن کاملJoint Learning of Chinese Words, Terms and Keywords
Previous work often used a pipelined framework where Chinese word segmentation is followed by term extraction and keyword extraction. Such framework suffers from error propagation and is unable to leverage information in later modules for prior components. In this paper, we propose a four-level Dirichlet Process based model (DP-4) to jointly learn the word distributions from the corpus, domain ...
متن کاملComparative Effect of Visual and Auditory Teaching Techniques on Retention of Word Stress patterns: A Case Study of English as a Foreign Language Curriculum in Iran
This study aimed at investigating the effect of visual (Cuisenaire Rods) and auditory nonsensical monosyllables using Pratt speech processing software as teaching techniques on retention of word stress. To this end, 60 high school participants made the two experimental groups of the study each having 30 students on the basis of their proficiency scores on KET (Key English Test). In one experime...
متن کاملChinese Language IR based on Term Extraction
In this paper, we’ll describe the core technology and modules we use in LIT (formerly KRDL)’s Chinese Language Information Retrieval System. The system mainly includes automatic term extraction from Chinese documents, query analysis based on the terms and finally measurement of the association between queries and documents. Compared with other methods, we try to use automatically acquired terms...
متن کاملWord Boundary Decision with CRF for Chinese Word Segmentation
Chinese word segmentation systems necessarily perform both accurately and quickly for real applications. In this paper, we study on word boundary decision (WBD) approach for Chinese word segmentation and implement it as a 2-tag character tagging with conditional random filed (CRF). With a help of tag transition features, WBD with CRF segmentation approach can achieve comparative performances co...
متن کامل